Introduction

Now that we have the Bag-of-Words, we can train machine learning models. We will consider CART, Random Forest, and Boosting models.

First, let’s read in the data that contains the final Bag-of-Words.

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
reviewsBagOfWords <- readRDS("data/reviewsBagOfWords.RDS")

Next, let’s convert the review score category to an ordered level data type. This is helpful when we have 3+ ordered categories, because it will determine how our confusion matrix is displayed for our results.

ratingLevels <- c("Terrible", "Low", "Mid", "High", "Perfect")
reviewsBagOfWords <- reviewsBagOfWords %>%
  mutate(review_scores_category = 
         as.factor(ordered(review_scores_category, levels = ratingLevels)))
## Warning: package 'bindrcpp' was built under R version 3.4.4

This is necessary for Random Forest

numericIndices <- which(grepl("^[0-9]", names(reviewsBagOfWords)))
names(reviewsBagOfWords)[numericIndices] <-
  paste0("x", names(reviewsBagOfWords)[numericIndices])

Splitting into Training/Testing

Before we begin, we first split our data into training/testing sets.

library(caTools)
## Warning: package 'caTools' was built under R version 3.4.4
set.seed(123)
X <- as.matrix(reviewsBagOfWords[,4:ncol(reviewsBagOfWords)])
y <- as.vector(reviewsBagOfWords$review_scores_category)
spl <- sample.split(y, SplitRatio = 0.75)
dataTrain <- reviewsBagOfWords[spl,] %>%
  select(-listing_id, -review_scores_rating) %>%
  rename(Y = review_scores_category)
dataTest <- reviewsBagOfWords[!spl,] %>%
  select(-listing_id, -review_scores_rating) %>%
  rename(Y = review_scores_category)
xTrain <- X[spl,]
xTest <- X[!spl,]
yTrain <- y[spl]
yTest <- y[!spl]

CART

First, fit a deep CART model (low cp value). We set a seed because the rpart() function performs cross-validation, which is a randomized algorithm.

library(rpart)
## Warning: package 'rpart' was built under R version 3.4.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.4.4
set.seed(123)
treeBig <- rpart(Y ~ ., data = dataTrain, cp = 0.001)

We can plot this initial tree:

prp(treeBig)

Next, we use the built-in cross-validation in CART to see how our model performs for varying values of cp. We can use the printcp() command to see the cross-validated error for different values of cp, where: - “nsplit” = number of splits in tree - “rel error” = scaled training error - “xerror” = scaled cross-validation error - “xstd” = standard deviation of xerror

printcp(treeBig)
## 
## Classification tree:
## rpart(formula = Y ~ ., data = dataTrain, cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] access   actual   airbnb   also     anoth    apart    beauti  
##  [8] bed      bit      boston   call     choic    clean    close   
## [15] comfort  condit   corner   das      day      definit  describ 
## [22] dirti    door     enjoy    especi   everyth  excel    extrem  
## [29] food     get      good     grate    great    help     high    
## [36] home     hospit   hot      howev    just     know     like    
## [43] locat    loud     love     made     make     man      might   
## [50] min      nearbi   need     nice     one      orang    person  
## [57] place    pretti   price    public   question quick    quiet   
## [64] rent     rental   respons  restaur  run      servic   shop    
## [71] south    space    spacious station  stay     street   tast    
## [78] thank    top      total    two      uber     use      want    
## [85] wash     wasnt    well     will     wonder  
## 
## Root node error: 1556/2122 = 0.73327
## 
## n= 2122 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.2005141      0   1.00000 1.00000 0.013093
## 2  0.0347044      1   0.79949 0.79949 0.014581
## 3  0.0122108      2   0.76478 0.76607 0.014689
## 4  0.0099614      4   0.74036 0.74679 0.014735
## 5  0.0089974      7   0.70694 0.73136 0.014763
## 6  0.0073907      8   0.69794 0.73072 0.014764
## 7  0.0051414     10   0.68316 0.72237 0.014776
## 8  0.0048201     12   0.67288 0.72044 0.014779
## 9  0.0044987     14   0.66324 0.72108 0.014778
## 10 0.0041774     15   0.65874 0.72494 0.014773
## 11 0.0038560     17   0.65039 0.72751 0.014769
## 12 0.0036418     26   0.61568 0.73008 0.014765
## 13 0.0032134     30   0.60026 0.74614 0.014737
## 14 0.0028920     31   0.59704 0.74486 0.014739
## 15 0.0025707     43   0.55334 0.74679 0.014735
## 16 0.0024100     58   0.51285 0.75514 0.014717
## 17 0.0022494     63   0.50064 0.75450 0.014718
## 18 0.0021422     69   0.48715 0.75450 0.014718
## 19 0.0019280     75   0.47429 0.75964 0.014706
## 20 0.0016710     95   0.43509 0.76542 0.014691
## 21 0.0016067    101   0.42481 0.76285 0.014698
## 22 0.0012853    105   0.41838 0.76285 0.014698
## 23 0.0010000    114   0.40617 0.77378 0.014667

Cool, so rpart automatically computes all of the cross-validated errors for trees with cp = 0.001 and up! We can also see these results visually using the plotcp command. In this plot: - size of tree = (number of splits in tree) + 1 - the dotted line occurs at 1 std. dev. above the minimum xerror

plotcp(treeBig)

A rule of thumb is to select the cp value which first goes below the dotted line, and then prune the tree using this value.

treeFinal <- prune(treeBig, cp = 0.009)
prp(treeFinal)

We can obtain predictions using the predict() function:

predTrain <- predict(treeFinal, newdata = dataTrain, type = "class")
predTest <- predict(treeFinal, newdata = dataTest, type = "class")

Let’s evaluate the accuracy of our model:

print("Training Set")
## [1] "Training Set"
t <- table(dataTrain$Y, predTrain)
t # Confusion Matrix
##           predTrain
##            Terrible Low Mid High Perfect
##   Terrible       57  38   0   15      64
##   Low            11 147   8  102     141
##   Mid             4  95  32  256     115
##   High            1  51  22  450      42
##   Perfect        19   4   2  110     336
sum(diag(t)) / nrow(dataTrain) # Accuracy
## [1] 0.4816211
print("Testing Set")
## [1] "Testing Set"
t <- table(dataTest$Y, predTest)
t # Confusion Matrix
##           predTest
##            Terrible Low Mid High Perfect
##   Terrible       19  15   0    1      23
##   Low             2  48   4   31      51
##   Mid             3  25   2   85      53
##   High            1  14   4  147      22
##   Perfect         8   1   0   47     101
sum(diag(t)) / nrow(dataTest) # Accuracy
## [1] 0.4483734

45% is not bad out-of-sample, when you consider that there are 5 possible categories. Can the other models do better?

Random Forest

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Fit the forest with 200 trees, removing a couple of the words which
# contain non-alphanumeric characters.
# Technical Note: The random forest function complains if we try to include
# variable names with non-alphanumeric characters.
modelForest <- randomForest(Y ~ .,
                          data = dataTrain[,c(1:219,221:749,751:1137)],
                          ntree = 200)

Make predictions on the test set

predTrain <- predict(modelForest, newdata = dataTrain[,c(1:219,221:749,751:1137)])
predTest <- predict(modelForest, newdata = dataTest[,c(1:219,221:749,751:1137)])

Check accuracy

print("Training Set")
## [1] "Training Set"
t <- table(dataTrain$Y, predTrain)
t # Confusion Matrix
##           predTrain
##            Terrible Low Mid High Perfect
##   Terrible      173   0   0    0       1
##   Low             0 402   0    0       7
##   Mid             0   0 501    0       1
##   High            0   0   0  566       0
##   Perfect         1   0   0    0     470
sum(diag(t)) / nrow(dataTrain) # Accuracy
## [1] 0.9952875
print("Testing Set")
## [1] "Testing Set"
t <- table(dataTest$Y, predTest)
t # Confusion Matrix
##           predTest
##            Terrible Low Mid High Perfect
##   Terrible       15  20   4    2      17
##   Low             0  33  36   23      44
##   Mid             0   8  42   66      52
##   High            0   1  18  152      17
##   Perfect         3   0  14   32     108
sum(diag(t)) / nrow(dataTest) # Accuracy
## [1] 0.4950495

So Random Forest increases our out-of-sample accuracy to ~47.5%. For this model, we can print out a Variable Important plot:

varImpPlot(modelForest)

We could also try to print out a sample tree, but these will be very deep and not very interpretable.

Boosting

Finally, let’s try a boosted tree model. We will use the package XGBoost, which is one of the top methods for winning online Kaggle competitions. Here is some more documentation for this package: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html

There are many parameters which are important to tune for this model. These include: - eta : learning rate. This is the gradient step size in the algorithm. So small values (eta <= 0.01) will take a long time to converge, while large values will converge quickly and may underfit. - nrounds : Number of trees in the Boosting algorithm. This should be tuned in relation to “eta”. - max.depth : Maximum depth of the trees. Can be tuned for better performance, but typically very short trees or stumps perform well.

library(xgboost)
## Warning: package 'xgboost' was built under R version 3.4.4
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
X_train <- model.matrix(Y ~ ., data = dataTrain)[,-1]
X_test <- model.matrix(Y ~ ., data = dataTest)[,-1]
# For multiclass problems with K classes, XGBoost requires the labels to 
# go from 0 to K-1.  Since factor variables go from 1,...K, we subtract 1 here.
dtrain <- xgb.DMatrix(data = X_train, label = as.integer(dataTrain$Y) - 1)
dtest <- xgb.DMatrix(data = X_test, label = as.integer(dataTest$Y) - 1)
watchlist <- list(train = dtrain, test = dtest)
params <- list(
  objective = "multi:softprob",
  eval_metric = "mlogloss",
  num_class = 5, # For multiclass problems, we write here number of classes K
  eta = 0.1,
  max_depth = 2)
nthread <- 1 # Number of threads to use on the computer
nrounds <- 100

set.seed(123)
boostedTree <- xgb.train(params = params, data = dtrain,
                         nrounds = nrounds, nthread = nthread,
                         watchlist = watchlist)
## [1]  train-mlogloss:1.566804 test-mlogloss:1.574540 
## [2]  train-mlogloss:1.529574 test-mlogloss:1.543979 
## [3]  train-mlogloss:1.497253 test-mlogloss:1.518770 
## [4]  train-mlogloss:1.468007 test-mlogloss:1.494595 
## [5]  train-mlogloss:1.441120 test-mlogloss:1.473930 
## [6]  train-mlogloss:1.417165 test-mlogloss:1.456853 
## [7]  train-mlogloss:1.395467 test-mlogloss:1.440404 
## [8]  train-mlogloss:1.374466 test-mlogloss:1.424539 
## [9]  train-mlogloss:1.355929 test-mlogloss:1.413812 
## [10] train-mlogloss:1.339434 test-mlogloss:1.401887 
## [11] train-mlogloss:1.323851 test-mlogloss:1.390460 
## [12] train-mlogloss:1.309225 test-mlogloss:1.381427 
## [13] train-mlogloss:1.295569 test-mlogloss:1.372312 
## [14] train-mlogloss:1.282944 test-mlogloss:1.363736 
## [15] train-mlogloss:1.270612 test-mlogloss:1.355640 
## [16] train-mlogloss:1.258374 test-mlogloss:1.348399 
## [17] train-mlogloss:1.247609 test-mlogloss:1.341230 
## [18] train-mlogloss:1.237245 test-mlogloss:1.335660 
## [19] train-mlogloss:1.227055 test-mlogloss:1.328611 
## [20] train-mlogloss:1.217489 test-mlogloss:1.322743 
## [21] train-mlogloss:1.208410 test-mlogloss:1.316952 
## [22] train-mlogloss:1.199686 test-mlogloss:1.312633 
## [23] train-mlogloss:1.191423 test-mlogloss:1.308705 
## [24] train-mlogloss:1.183369 test-mlogloss:1.304309 
## [25] train-mlogloss:1.175521 test-mlogloss:1.299025 
## [26] train-mlogloss:1.167961 test-mlogloss:1.294592 
## [27] train-mlogloss:1.160444 test-mlogloss:1.289609 
## [28] train-mlogloss:1.153538 test-mlogloss:1.285131 
## [29] train-mlogloss:1.146878 test-mlogloss:1.281997 
## [30] train-mlogloss:1.140206 test-mlogloss:1.277410 
## [31] train-mlogloss:1.134315 test-mlogloss:1.274387 
## [32] train-mlogloss:1.127629 test-mlogloss:1.270981 
## [33] train-mlogloss:1.121957 test-mlogloss:1.267450 
## [34] train-mlogloss:1.116110 test-mlogloss:1.264185 
## [35] train-mlogloss:1.110761 test-mlogloss:1.261163 
## [36] train-mlogloss:1.105527 test-mlogloss:1.258574 
## [37] train-mlogloss:1.099923 test-mlogloss:1.255841 
## [38] train-mlogloss:1.094668 test-mlogloss:1.252924 
## [39] train-mlogloss:1.089838 test-mlogloss:1.249797 
## [40] train-mlogloss:1.084797 test-mlogloss:1.247731 
## [41] train-mlogloss:1.080003 test-mlogloss:1.244753 
## [42] train-mlogloss:1.075240 test-mlogloss:1.243580 
## [43] train-mlogloss:1.070701 test-mlogloss:1.241751 
## [44] train-mlogloss:1.066336 test-mlogloss:1.239819 
## [45] train-mlogloss:1.062062 test-mlogloss:1.237804 
## [46] train-mlogloss:1.057435 test-mlogloss:1.235940 
## [47] train-mlogloss:1.052973 test-mlogloss:1.234079 
## [48] train-mlogloss:1.049040 test-mlogloss:1.232525 
## [49] train-mlogloss:1.044791 test-mlogloss:1.230383 
## [50] train-mlogloss:1.040891 test-mlogloss:1.228033 
## [51] train-mlogloss:1.036912 test-mlogloss:1.226314 
## [52] train-mlogloss:1.032748 test-mlogloss:1.225106 
## [53] train-mlogloss:1.029062 test-mlogloss:1.224244 
## [54] train-mlogloss:1.025053 test-mlogloss:1.223026 
## [55] train-mlogloss:1.021513 test-mlogloss:1.220692 
## [56] train-mlogloss:1.017879 test-mlogloss:1.219475 
## [57] train-mlogloss:1.014563 test-mlogloss:1.218697 
## [58] train-mlogloss:1.010892 test-mlogloss:1.217827 
## [59] train-mlogloss:1.007726 test-mlogloss:1.216723 
## [60] train-mlogloss:1.004298 test-mlogloss:1.215281 
## [61] train-mlogloss:1.000969 test-mlogloss:1.214648 
## [62] train-mlogloss:0.997644 test-mlogloss:1.214504 
## [63] train-mlogloss:0.994448 test-mlogloss:1.213299 
## [64] train-mlogloss:0.991105 test-mlogloss:1.211877 
## [65] train-mlogloss:0.987848 test-mlogloss:1.210477 
## [66] train-mlogloss:0.984571 test-mlogloss:1.209312 
## [67] train-mlogloss:0.981382 test-mlogloss:1.208515 
## [68] train-mlogloss:0.978425 test-mlogloss:1.207316 
## [69] train-mlogloss:0.975413 test-mlogloss:1.206419 
## [70] train-mlogloss:0.972293 test-mlogloss:1.204307 
## [71] train-mlogloss:0.969384 test-mlogloss:1.204427 
## [72] train-mlogloss:0.966243 test-mlogloss:1.203319 
## [73] train-mlogloss:0.963548 test-mlogloss:1.202075 
## [74] train-mlogloss:0.960367 test-mlogloss:1.200667 
## [75] train-mlogloss:0.957406 test-mlogloss:1.198975 
## [76] train-mlogloss:0.954592 test-mlogloss:1.197262 
## [77] train-mlogloss:0.951789 test-mlogloss:1.196298 
## [78] train-mlogloss:0.948900 test-mlogloss:1.195658 
## [79] train-mlogloss:0.946345 test-mlogloss:1.195372 
## [80] train-mlogloss:0.943247 test-mlogloss:1.193742 
## [81] train-mlogloss:0.940740 test-mlogloss:1.193286 
## [82] train-mlogloss:0.938058 test-mlogloss:1.192671 
## [83] train-mlogloss:0.935579 test-mlogloss:1.191937 
## [84] train-mlogloss:0.932535 test-mlogloss:1.190958 
## [85] train-mlogloss:0.930082 test-mlogloss:1.189354 
## [86] train-mlogloss:0.927524 test-mlogloss:1.188316 
## [87] train-mlogloss:0.925077 test-mlogloss:1.188252 
## [88] train-mlogloss:0.922644 test-mlogloss:1.187224 
## [89] train-mlogloss:0.920334 test-mlogloss:1.187318 
## [90] train-mlogloss:0.917842 test-mlogloss:1.186612 
## [91] train-mlogloss:0.915554 test-mlogloss:1.185548 
## [92] train-mlogloss:0.913153 test-mlogloss:1.185273 
## [93] train-mlogloss:0.910597 test-mlogloss:1.184518 
## [94] train-mlogloss:0.908087 test-mlogloss:1.183620 
## [95] train-mlogloss:0.905358 test-mlogloss:1.182822 
## [96] train-mlogloss:0.903516 test-mlogloss:1.182201 
## [97] train-mlogloss:0.900974 test-mlogloss:1.182170 
## [98] train-mlogloss:0.898526 test-mlogloss:1.181829 
## [99] train-mlogloss:0.896312 test-mlogloss:1.180648 
## [100]    train-mlogloss:0.894040 test-mlogloss:1.179544

Find the probability predictions for each class:

predTrainLongVec <- predict(boostedTree, X_train)
predTestLongVec <- predict(boostedTree, X_test)
predTrainMatrix <- matrix(predTrainLongVec, ncol = 5, byrow = T)
predTestMatrix <- matrix(predTestLongVec, ncol = 5, byrow = T)
predTrain <- apply(predTrainMatrix, 1, which.max)
predTest <- apply(predTestMatrix, 1, which.max)

Find the accuracy:

print("Training Set")
## [1] "Training Set"
t <- table(dataTrain$Y, predTrain)
t # Confusion Matrix
##           predTrain
##              1   2   3   4   5
##   Terrible 111  19  18   2  24
##   Low       10 213  54  26 106
##   Mid        4  14 266 121  97
##   High       1   6  40 481  38
##   Perfect    4   5  12  48 402
sum(diag(t)) / nrow(dataTrain) # Accuracy
## [1] 0.6941565
print("Testing Set")
## [1] "Testing Set"
t <- table(dataTest$Y, predTest)
t # Confusion Matrix
##           predTest
##              1   2   3   4   5
##   Terrible  25  13   3   0  17
##   Low        7  30  40  19  40
##   Mid        1  17  43  64  43
##   High       0   7  28 136  17
##   Perfect    3   3  11  34 106
sum(diag(t)) / nrow(dataTest) # Accuracy
## [1] 0.4809052

It looks like we have increased the out-of-sample accuracy slightly more, up to 48.9% out-of-sample.

We can plot the first few trees in our XGBoost model:

# install.packages("DiagrammeR")
library(DiagrammeR)
## Warning: package 'DiagrammeR' was built under R version 3.4.3
# First Tree to predict "Terrible"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
              trees = 0)
# First Tree to predict "Low"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
              trees = 1)
# First Tree to predict "Medium"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
              trees = 2)
# First Tree to predict "High"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
              trees = 3)
# First Tree to predict "Perfect"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
              trees = 4)
# Second Tree to predict "Terrible"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
              trees = 5)

Each individual tree is reasonably interpretable. However, in order to understand this model, it requires much more analysis because there are 500 trees total.

We can also compute the variable importance scores for this XGBoost model.

xgb.importance(feature_names = names(dataTrain)[2:1137], boostedTree)
##      Feature         Gain        Cover    Frequency
##   1:    good 9.889497e-02 2.139099e-02 0.0166666667
##   2:    stay 7.106747e-02 2.203154e-02 0.0140000000
##   3:  beauti 5.451984e-02 2.911681e-02 0.0260000000
##   4:   great 5.097638e-02 3.822557e-02 0.0280000000
##   5:   dirti 4.674607e-02 4.215097e-02 0.0426666667
##  ---                                               
## 289:    tidi 7.863741e-05 1.491958e-05 0.0006666667
## 290:    inde 7.638627e-05 8.951719e-06 0.0006666667
## 291:    pick 4.654224e-05 2.970406e-05 0.0006666667
## 292:   grand 3.354063e-05 2.538920e-05 0.0006666667
## 293:   green 6.029264e-06 2.072554e-05 0.0006666667